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ABSTRACT 

Efficient and automated classification of periodic variable stars is becoming 
increasingly important as the scale of astronomical surveys grows. Several re- 
cent papers have used methods from machine learning and statistics to construct 
classifiers on databases of labeled, multi-epoch sources with the intention of us- 
ing these classifiers to automatically infer the classes of unlabeled sources from 
new surveys. However, the same source observed with two different synoptic sur- 
veys will generally yield different derived metrics (features) from the light curve. 
Since such features are used in classifiers, this survey-dependent mismatch in 
feature space will typically lead to degraded classifier performance. In this paper 
we show how and why feature distributions change using OGLE and Hipparcos 
light curves. To overcome survey systematics, we apply a method, noisification, 
which attempts to empirically match distributions of features between the la- 
beled sources used to construct the classifier and the unlabeled sources we wish 
to classify. Results from simulated and real-world light curves show that noisifi- 
cation can significantly improve classifier performance. In a three-class problem 
using light curves from Hipparcos and OGLE, noisification reduces the classifier 
error rate from 27.0% to 7.0%. We recommend that noisification be used for 
upcoming surveys such as Gaia and LSST and describe some of the promises and 
challenges of applying noisification to these surveys. 

Subject headings: classification, periodic variables, machine learning, statistics, 
errors in variables 



1. Introduction 

Classification of periodic variables is crucial for scientific knowledge discovery and ef- 
ficient use of telescopic resources for source follow up (Eyer & Mowlavi 2008; Walkowicz 
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et al. 2009). As the size of synoptic surveys has grown, a greater and greater share of the 
classification process must become automated (Bloom & Richards 2011). With Hipparcos, 
it was possible for astronomers to individually analyze and classify each of the 2712 periodic 
variables observed in the survey. Starting in 2013, Gaia is expected to discover ~ 5 million 
classical periodic variables over the course of its 4-5-year mission (Eyer & Cuypers 2000). 
LSST, for that matter, may collect on the order of a billion (Borne et al. 2007). Individual 
analysis and classification by hand of all periodic variables is no longer feasible. 

The need for efficient and accurate source classification has motivated much recent work 
on applying statistical and machine learning methods to variable star data sets (e.g., Eyer 
& Blake 2005; Debosscher et al. 2007; Richards et al. 2011; Dubath et al. 2011). In these 
papers, classifiers were constructed using light curves from a variety of surveys, such as 
the Optical Gravitational Lensing Experiment (OGLE, Soszyhski et al. 2011), Hipparcos 
(Perryman et al. 1997), The All-Sky Automated Survey (ASAS, Pojmanski et al. 2005), the 
COnvection, Rotation & planetary Transits survey (CoRoT, Auvergne et al. 2009), and the 
Geneva Extrasolar Planet Search. Often the intention of these studies is to develop classifiers 
with high accuracy in classifying sources from surveys other than those used to construct the 
classifier. For example, Blomme et al. (2011) trained a classifier on a mixture of Hipparcos, 
OGLE, and CoRoT sources and used it to classify sources from the Trans-atlantic Exoplanet 
Survey (TrES, O'Donovan et al. 2009) Lyrl field. Dubath et al. (2011) and Eyer et al. (2008) 
view their work on classification of Hipparcos sources as a precursor to classification of yet- 
to-be collected Gaia light curves. Debosscher and collaborators trained a classifier on a 
mixture of OGLE and Hipparcos sources in attempts to classify CoRoT sources (Debosscher 
et al. 2007; Sarro & Debosscher J. 2008; Debosscher et al. 2009). 

It is well known that systematic differences in cadence, observing region, flux noise, 
detection limits, and number of observed epochs per light curve exist among surveys. Even 
within surveys there is heterogeneity in these characteristics. Most statistical classifiers 
assume that the light curves of a known class used to construct the classifier, termed training 
data, and the light curves of unknown class which we wish to classify, termed unlabeled data, 
share the same characteristics. This is unlikely to be the case when training and unlabeled 
light curves come from different surveys, or when the best-quality light curves of sources 
from each class are used to classify poorly sampled light curves of unknown class from the 
same survey. 

To illustrate how seriously survey mismatches can deteriorate classification performance, 
consider the three-class problem of separating Mira variables, Classical Cepheids, and Fun- 
damental Mode RR Lyrae from the Hipparcos and OGLE surveys. From OGLE, we use 
V-band data. Note that OGLE is far better sampled in I-band than V-band. We use V- 
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Fig. 1. — (a) The grey lines represent the CART classifier constructed using Hipparcos data. 
The points are Hipparcos sources. This classifier separates Hipparcos sources well (0.6% 
error as measured by cross-validation), (b) Here the OGLE sources are plotted over the 
same decision boundaries. There is now significant class overlap in the amplitude-fold2P 
plane (30% error rate). This is due to shifts in feature distributions across surveys. 

band to create a setting where one set of data is well sampled while the other set is poorly 
sampled. See Section 5.3 and Table 2 for more information on these sources. 

For each light curve we compute dozens of metrics, termed features, that contain im- 
portant information related to source class (e.g., frequency and amplitude; see Section 2 
for details on feature selection and extraction). Using the Hipparcos light curves we con- 
struct a classifier using CART. 1 The resulting classifier uses only two features for separating 
classes: the amplitude of a best fit sinusoidal model and the 90 th percentile of the slope 
between phase adjacent flux measurements after the light curve has been folded on twice the 
estimated period. 

Figure la displays these two features for each Hipparcos source with grey lines denoting 
the class boundaries chosen by CART. Based on the Hipparcos light curves, this looks like 
an excellent classifier as each of the three regions of feature space selected by CART contains 



1 CART (Classification And Regression Trees) is a popular classifier that forms a sequence of nested binary 
partitions of feature space. See Brciman et al. (1984) for more on CART. 
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sources of only one class. However, examining a subset of the OGLE sources, Figure lb, 
shows large class overlap on these two features. Here these two features do not separate 
OGLE sources well. The error rate measured by cross-validation on the Hipparcos sources 
was only 0.6% 2 . However, the misclassification rate on the OGLE sources is 30.0%. 

Despite what the 30.0% error rate seems to imply, the problem of separating classes in 
OGLE is not inherently difficult. A CART classifier trained on the OGLE light curves has 
a cross- validated error rate of 1.3%. While there are many systematic differences between 
the Hipparcos and OGLE surveys, their radically different cadences and number of flux 
measurements per light curve appear to be driving the increase in misclassification rate. For 
example, both features in Figure 1 depend on the estimate of each source's period; yet, over 
25% of the RR Lyrae in OGLE have incorrectly estimated periods due to poor sampling in 
the V-band. 

A natural question to ask is: If we had observed the Hipparcos sources at an OGLE 
cadence, what classifier would CART have constructed, and how would this have changed 
the error rate? In this paper we use noisification, a method which matches the cadence of 
training data and unlabeled data by inferring a continuous periodic function for each training 
light curve and then extracting flux measurements at the cadence and photometric error level 
present in the unlabeled light curves. The purpose of noisification is to automatically shift 
the distribution of features in the training data closer to the distribution of features in 
the unlabeled data so that a classifier can determine class boundaries as they exist in the 
unlabeled data. Versions of noisification were introduced in Starr et al. (2010) and Long et al. 
(2011). In this paper, we demonstrate that noisification improves classification accuracy on 
several simulated and real-world data sets. For instance, on the OGLE - Hipparcos three 
class problem we reduce misclassification rate by 20.0%. Performance increases are greatest 
when the training data is well sampled at a particular cadence while unlabeled light curves 
are either poorly time sampled or observed at a different cadence. 

This paper is organized as follows. In Section 2 we briefly outline the statistical classifi- 
cation framework and show how it is applied in the context of periodic variables. In Section 3 
we illustrate the problems that occur when training and unlabeled data come from different 
surveys. We present noisification, a method for overcoming differences related to number 
of flux measurements, cadence, and photometric error in Section 4. In Section 5 we apply 
noisification to several data sets. Finally in Section 6 we discuss possible uses of noisification 
for upcoming surveys. 



2 See 2.4 for a definition of cross-validation 
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2. Overview of Classification of Periodic Variables 

Here we review a methodology for constructing, implementing, and evaluating statistical 
classifiers for periodic variables. This approach has been used in many recent works. For 
a more detailed review of the methodology see Debosscher et al. (2007) or Richards et al. 
(2011). 

2.1. Constructing a Classifier 

We start with a set of light curves of known class, termed training data and a set of light 
curves of unknown class, termed unlabeled data. Our goal is to determine the classes for 
the unlabeled light curves using information present in the training data. Each light curve 
consists of a set of time, flux, and photometric error measurements. We compute functions 
of the time, flux, and photometric error, termed features. Features are chosen to contain 
information relevant for differentiating classes. The same set of features is computed for each 
light curve. A statistical classification method uses the training data to learn a relationship 
between features and class and produces a classifier C. Given the features, x, for a light 
curve in the unlabeled set, C(x) is a prediction of its class. 

2.2. Feature Set 

We use a total of 62 features to describe each light curve. 50 of these features are 
described in Tables 4 and 5 of Richards et al. (2011). 3 We use 12 other features, described 
in Appendix A of this article. Many of the features that we use are obvious choices e.g., 
frequency and amplitude. Most of our features, or features very similar to the ones here, 
have been used in recent work on classification of periodic variables (Kim et al. 2011; Dubath 
et al. 2011). 

2.3. Choosing a Classifier 

There are many statistical classification methods for constructing the function C. Some 
of the most popular include linear discriminant analysis (LDA), neural networks, support 
vector machines (SVMs), and Random Forests. In an earlier example we used CART. Each 



3 We do not use pair_slope_trend, max_slope, or linear_trend. 
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classification method has its own strengths and weaknesses. See Hastie et al. (2009) for an 
extensive discussion of classification methods. In this work we use the Random Forests clas- 
sifier developed by Breiman (2001), Amit & Geman (1997), and Dietterich (2000). Random 
Forests has been used, with high levels of success, in recent studies of automated variable 
star classification (Richards et al. 2011; Dubath et al. 2011). Richards et al. (2011), in a 
side-by-side comparison of 10 different classifiers using OGLE and Hipparcos data, found 
that Random Forest had the lowest error rate. 



2.4. Estimating Classifier Accuracy 

Usually, researchers want an estimate of how accurate the classifier, C, will be when 
presented with new, unlabeled data. Simply calculating the proportion of times C cor- 
rectly classifies light curves in the training data is a poor estimate of classifier success, as 
this typically overestimates classifier performance on unlabeled data. Better assessment of 
classifier performance on unlabeled data is attained by using training-test set splits or cross- 
validation. With training-test set splits a fraction of the data, usually between 10% and 30%, 
is "held out" while the rest of the data is used to train the classifier. Subsequently, the held 
out observations are classified and the accuracy recorded. This number provides an estimate 
of how well the classifier will perform on unlabeled observations. In cross-validation, the 
training-test split is repeated many times, holding out a different set of observations at each 
iteration. The accuracy of the classifier is recorded at each iteration and then averaged. See 
Chapter 7 of Hastie et al. (2009) for more information on assessing classifier performance. 
Cross-validation has been the method of choice for evaluating classifier performance in many 
of the recent articles on classification of periodic variables. 

3. Feature Distributions and Survey Systematics 

The classification framework described above comes with assumptions and limitations. 
Of critical importance, statistical classification methods are only designed to produce accu- 
rate classifiers when the relationship between features and classes is the same in training 
and unlabeled data. This is formalized as follows. Let z represent the class for a source 
with features x. Let pt r (z\x) be the probability of class given features in the training set and 
p u (z\x) be the probability of class given features for unlabeled data. Statistical classifiers 
are designed to have high accuracy when p tr {z\x) = p u (z\x). In the three class example in 
the introduction, we saw that this was not the case due, in part, to incorrect estimation 
of periods in the unlabeled (OGLE) light curves. Violating this assumption will also cause 
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cross-validation to make incorrect predictions of classifier accuracy. 

In this section we illustrate the complex connection between survey systematics and 
feature distributions. We show how this connection causes the p tr {z\x) = p u (z\x) assumption 
to break, potentially leading to poor classifier performance on the unlabeled data. 



3.1. Periodic Features 




Fig. 2. — (a) Distribution of frequency for three source classes observed for entire length 
of Hipparcos. (b) Distribution of frequency for same three sources classes observed for first 
365 days of Hipparcos. A classifier constructed on the the complete Hipparcos light curves 
is likely to have poor performance on the Hipparcos curves truncated to 365 days. This 
scenario could happen if Hipparcos light curves were used to construct a classifier that was 
then applied to short light curves from the first Gaia data release at 1-2 years into the 
mission. 

Nearly every study of classification of periodic variables has used period (or frequency) 
as a feature. Often in the training set, the period is correct for a large majority of sources 
due to the investigators selecting the highest quality light curves of each source class of 
interest. However, if periods are estimated incorrectly for the unlabeled data, then a classifier 
constructed on the training data may not capture the period-class relationship as it exists 
for the unlabeled data. 
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For example, it has been suggested that light curves from early Gaia data releases be 
labeled using classifiers trained on Hipparcos light curves (Eyer et al. 2008; Eyer & et al. 
2010). Figure 2a shows a density plot of the estimated frequency for three source classes 
in Hipparcos 4 using light curves from the entire 3.5-year survey. The median number of 
flux measurements per light curve is 91. However, one year into Hipparcos the densities of 
the estimated frequency for these source classes look significantly different (Figure 2b). The 
median number of flux measurements per light curve is now 29. Thus, even if we assume 
that Gaia and Hipparcos have similar survey characteristics, a classifier built on the 3.5-year 
baseline Hipparcos training set will not accurately capture the frequency-class relationship 
as it exists in 1-year Gaia data. This is due to incorrect estimates of frequency for the 
1-year length light curves. Since it is often the case that many features depend on frequency 
(e.g. Table 4 of Richards et al. (2011) and Section 4.5 of Dubath et al. (2011)), systematic 
differences in estimates of frequency can alter the distributions of many features. 



3.2. Time-Ordered Flux Measurements 
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Fig. 3. — Feature distributions can change dramatically with cadence. Plotted are the 
distributions of the P2PS feature (see equation (1)) for two simulated classes observed at (a) 
30 minute, (b) 2 day, and (c) 10 day cadences. A classifier trained on these light curves at 
one particular cadence may have poor performance when applied to light curves observed at 
a different cadence due to this change in feature distribution. 



Several recent studies of classification of periodic variables have used features that de- 
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pend on the time ordering of flux measurements. For example, Dubath et al. (2011) used 
point-to-point scatter (P2PS), the median of absolute differences between adjacent flux mea- 
surements divided by the median absolute difference of flux measurements around the me- 
dian. Specifically, given some light curve x with time ordered flux measurements mo, . . . rrik, 

P2PSW- M{{\ mi -m^\}U) 



M({\ mi - M({m^ =0 )}t ) 

where Ai denotes the median. While potentially useful for classification, the behavior of 
this feature is heavily dependent on the cadence of time sampling. To see this, consider a 
two class problem where class 1 is sine waves of amplitude 1 with period drawn uniformly 
at random between 0.25 days and 0.75 days and class 2 is sine waves of amplitude 1 where 
period is drawn uniformly at random between 2 days and 8 days. Say we observe 20 flux 
measurements for each source. Figure 3 shows the density of P2PS for 200 sources of each 
class with (a) 30 minutes, (b) 2 days, and (c) 10 days between successive flux measurements. 
At 30 minutes and 2 days the feature is useful for distinguishing classes, but in opposite 
directions. At 10 days the feature is no longer useful. 

The process of how cadence and period produce the P2PS feature density is complex. 
For class 2 (2 day to 8 day periods) at 30 minute cadence, the flux measurements for each 
source are often monotonically increasing or decreasing, producing a small numerator relative 
to denominator in equation (1). When the cadence is large relative to the distribution of 
periods for the source class, the functional shape of the light curve determines the P2PS 
density. In Figure 3c where the cadence is longer than any possible period for either class, 
the two classes have the same density because they have the same functional shape (sine 
waves). 

Note that this extreme sensitivity to cadence is not based on having 20 flux measure- 
ments per light curve. Running these simulations with 100 flux measurements per light 
curve produces densities of roughly the same shape. Rather, this example suggests how 
useful P2PS may be for distinguishing between classes in a setting where it may be difficult 
to determine a correct period (20 flux measurements per light curve), and how sensitive it 
is to systematic differences in cadence between training and unlabeled data. 



3.3. Time- Independent Features 

Finally, some of the most useful features for periodic variable classification are simple 
functions of flux measurements such as estimated amplitude, standard deviation, and skew. 
Figure 4 shows how estimated amplitude of Miras differs in distribution between the Hip- 
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Fig. 4. — Distribution of amplitude for Miras in OGLE and Hipparcos. Hipparcos Miras do 
not have very large amplitudes seen in some OGLE sources. The modes of the distributions 
are different as well. 

parcos and OGLE surveys. 5 In Hipparcos there are no Miras with amplitude greater than 3 
mag while roughly 12% of Miras in OGLE have amplitude greater than 3 mag. The mode 
of the densities is different as well. 

There are several possible causes for the difference in shape of these densities. The 
median difference between last observation time and first observation time for OGLE sources 
is 1902 days and 1142 days for Hipparcos. Since Miras vary in amplitude through each 
period, it is possible that OGLE is simply observing more periods and picking up on lower 
troughs and higher peaks than Hipparcos. Additionally, many OGLE sources have large 
mean photometric error (not shown), which may be driving up estimates of amplitude. 
Also, OGLE and Hipparcos sources were observed with different filters, possibly leading to 
biases in estimated amplitude. 

It is also worth noting that the Hipparcos catalog light curves are themselves a com- 
posite of Selected sources chosen for their scientific interest before the mission and a set of 
Survey sources which represent a nearly complete sample to well defined magnitude limits 
(which depend spectral type and galactic latitude). Figure 5 shows boxplots of amplitudes 



5 The Hipparcos Miras were used in (Dcbosscher et al. 2007). The OGLE sources are V-band data from 
OGLE III Catalog of Variable Stars: http://ogledb.astrouw.edu.pl/~ogle/CVS/ 
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Fig. 5. — Distribution of amplitude for Survey and Selected sources in Hipparcos. The 
Selected sources have systematically larger amplitudes than the Survey sources. 

in Hipparcos for classes with over 50 sources, blocked into Survey and Selected. The Selected 
sources appear to have larger amplitudes on average than the Survey sources. A statistical 
classifier trained on this data will discover class boundaries for this mixture of Selected and 
Survey sources. However if the unlabeled data resemble the Survey sources, these boundaries 
may not separate classes well. 



4. Noisification 



We have shown how differences in survey systematics can alter feature distributions and 
deteriorate classifier performance. These survey systematics exist between and within sur- 
veys. In this section we describe noisification, our solution to addressing training-unlabeled 
set differences. We use noisification to overcome differences in training-unlabeled feature 
distributions caused by differences in the number of flux measurements, cadence, and level 
of photometric error of light curves. Before introducing noisification we discuss a few recent 
works in the periodic variable classification literature that account for differences in training 
and unlabeled data and the extent to which they address distribution shifts discussed in 
Section 3. 
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4.1. Related Work 

Two recent works, Richards et al. (2012) and Debosscher et al. (2009), have adapted 
classifiers to address training-unlabeled data set differences by adding unlabeled data to the 
training set. Richards et al. (2012) applied an active learning methodology to successfully 
improve classifier performance on ASAS unlabeled data using OGLE and Hipparcos training 
data. Debosscher et al. (2009) used a method similar to self-training (Nigam & Ghani 2000) 
where after applying a classifier trained on Hipparcos and OGLE sources to CoRoT data, 
the most confidently labeled CoRoT sources were added to the training data. From this 
new training set, they constructed a classifier and used it to classify the remaining CoRoT 
sources. 

Both active learning and self-training are designed to work when the feature densities 
in training and unlabeled data are different, but the feature-class relationship is the same. 
More formally, if Pt r (x) and p u (x) are the feature densities in training and unlabeled data, 
then Active Learning and self-training are designed to address the setting where pt r (x) ^ 
£> u (x), not ptr{z\x) 7^ p u (z\x). However with our problem, differences in number of flux 
measurements, cadence, and photometric error induce different relationships between class 
and features. For instance, consider the P2PS cadence example in §3.2, Figure 3. If the left 
plot, (a), is the training data P2PS class densities and the center plot, (b), is the unlabeled 
P2PS class densities, then moving data from (b) to (a) (as is done with Active Learning 
and self-training) would produce class densities that are a mixture of (a) and (b). Training 
a classifier on a mixture of (a) and (b) densities is unlikely to produce a classifier that has 
high accuracy on data with the classes densities in (b). 

A method that comes closer to addressing class-feature distribution differences was used 
in Debosscher et al. (2009) to overcome aliasing in period estimation. There the authors 
found that the 13.97 -1 day orbital frequency of the CoRoT mission caused spurious spectral 
peaks and induced incorrect period estimation for sources. Their solution was to disregard 
spectral peaks at the orbital frequency. 

Effectively, Debosscher et al. (2009) asked the question "What would the value of this 
light curve's period feature have been if it had been observed at a cadence matching the 
training data." In their case, the answer is fairly staightfoward. However it is much less 
clear how to correct other features in a similar manner. If the unlabeled sources are observed 
for 10 days, then it is likely that estimates of amplitude are biased. But by how much? If 
the source is a Mira, then likely by a lot, but if the source is an RR Lyrae possibly not at 
all. So in order to correct amplitude estimates we need to know, or have some idea, of the 
class of the unlabeled source. But this returns to the goal of classification in the first place. 
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In Long et al. (2011) this approach was termed denoisification. For each unlabeled 
source the authors estimated a distribution across features representing uncertainty on what 
the feature values would have been if the source had been observed at a cadence, noise-level, 
and number of flux measurements in the training data. This distribution was combined 
with a classifier constructed on training data in order to classify unlabeled sources. While 
denoisification was superior to not adjusting for training-unlabeled distribution differences, 
the method did not achieve as large performance increases as noisification. 

Noisification overcomes training-unlabeled set differences by altering the training set 
so that the number of flux measurements, cadence, and photometric error match that of 
the unlabeled data. A classifier can then use this "noisified" training data to determine 
class boundaries as they exist for the unlabeled data. Noisification was introduced in Starr 
et al. (2010). Long et al. (2011) described a specific version of noisification appropriate 
for when training and unlabeled data have different numbers of flux measurements but are 
otherwise identical. Here we describe a far more general version of noisification which can 
be used across surveys when unlabeled sources have a systematically different number of 
flux measurements, cadence, and photometric error than the training data. Code written in 
Python and R is available for implementing noisification of light curves. 6 



4.2. Implementation of Noisification 

Given a set of training light curves, we first estimate a period for each. 7 Next, we smooth 
the period folded light curves, turning each set of flux measurements into a continuous 
periodic function. Select a light curve x from the training set, and then at random choose 
a light curve, / from the unlabeled set. Let g be the smooth periodic function associated 
with x. Let l^ijkp, and 1^ represent the time, flux and photometric error for epoch i of 
light curve I. Say there are m flux measurements for light curve I. We now extract flux 
measurements from the periodic function g matching the cadence and photometric error 
present in I. Specifically, if we let x^i^x^, and x i)3 be the time, flux, and photometric error 



6 Codc available here: http://stat.berkeley.edu/~jlong/noisification 

7 Noisification assumes we have training sources that are of high enough quality that we can estimate 
periods accurately. 
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of light curve x noisified to light curve Z, then we have, 



Xi,l 



h,i 

g(l isl + a) + €i 



(2) 



Xi,2 



for i e {1, . . . , m} where 




a is a phase offset drawn uniformly at random between and the period of g, p. This 
represents that fact that we are equally likely to start observing a source at any point in its 
phase, e, is the the photometric error added to each flux measurement. 

The cadence and level of photometric error in this new, noisified version of light curve 
x now match that of the unlabeled data. Repeat this process for every training light curve. 
Then derive features for the noisified training data, train a classifier on these observations, 
and classify the unlabeled light curves using this classifier. We call this process noisification 
because if our training data consists only of well-sampled light curves and our unlabeled data 
consists mainly of poorly sampled light curves, then the technique effectively adds noise to 
features in the training data to more closely match the characteristics of the unlabeled 
features. See Figure 6 for a concise description of the algorithm. 



1. smooth training light curves, turning them into continuous periodic functions 

2. extract flux measurements from these functions so that the number of flux mea- 
surements, cadence, and photometric error match the unlabeled data 

3. derive features from these altered {noisified) training data light curves 

4. construct a classifier using these light curve features 

5. apply classifier to unlabeled sources 



Noisification Algorithm 



Fig. 6. — Description of the light curve noisification algorithm. 



- 15 - 



4.3. Remarks on Noisification 

There are a few important points to note about this procedure. First, if the training and 
unlabeled data have the same cadence and photometric error, then smoothing the training 
light curves is not necessary. This would be the case, for example, if we had a set of training 
light curves of known class with many flux measurements (~ 100) from one survey and we 
wanted to classify an unlabeled set of poorly sampled light curves (~ 30 flux measurements) 
of similar cadence and photometric error level from the same survey as the training data. 
Then we could simply take the training light curves, truncate them at 30 flux measurements, 
train a classifier on the truncated curves, and apply this classifier to the unlabeled light 
curves. This setting has the added benefit that no error will be introduced by smoothing 
the light curves. In this case the training sources do not need to be periodic. 

Secondly, the procedure as described is most appropriate if all of the unlabeled data have 
similar numbers of flux measurements, cadence and photometric error. If this is not the case, 
then we can repeat the procedure several times using different subsets of the unlabeled data 
which share similar properties. For example, if unlabeled light curves have either around 20 
or around 70 flux measurements, then we could break the unlabeled data into two sets and 
classify each set using a separate run of the noisification procedure. The more subsets of the 
unlabeled data one uses, the closer the noisified training data gets to the unlabeled data. 
The tradeoff is computational burden. With n training light curves and m unlabeled light 
curves, noisifying to precisely match the properties of each unlabeled light curve requires 
deriving features for nm light curves. In Section 5 we explore how much one can gain from 
dividing the unlabeled data into subsets. 

With noisification, the unlabeled light curve, /, at which to noisify training light curve x, 
a and e are all random. Thus, repeating the noisification process several times and obtaining 
several classifiers offers potential for improvement in classifier performance over running the 
process once. We study this in Section 5. While building several classifiers may be a good 
idea, it is important not to train a classifier using several noisified versions of the same light 
curve as the training data would no longer be independent. This can cause classifiers to 
overfit the data, hurting classifier performance. 

Note that noisification is classifier independent. We use Random Forests in this work, 
but noisification can be used in conjunction with essentially any statistical classification 
method. Here we use Super Smoother for transforming training light curves into continuous 
periodic functions (Friedman 1984) 8 . The method used for inferring continuous training 



Fortran code here: http://www-stat.stanford.edU/~jhf/ftp/progs/supsmu.f. We used automatic 
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curves is separate from the the rest of the noisification process. Splines and Nadaraya- 
Watson methods are other possibilities. Splines are described in 5.4 of Hastie et al. (2009). 
See Hall (2008) for using Nadaraya-Watson with periodic variables. 

Finally we stress that this implementation of noisification is limited to addressing differ- 
ences between training and unlabeled sets caused by number of flux measurements, cadence, 
and photometric error. We do not correct for differences in feature distributions due to 
observing regions, detection limits, or filters. 



To get a sense how noisification performs in a controlled setting, we first test the method 
using training and unlabeled data from the same survey, but with systematically differing 
number of flux measurements. This resembles the real-life situation where well sampled light 
curves of known class are used as training data to classify poorly sampled curves of unknown 
class from the same survey. The cadence and levels of photometric error are assumed to 
match in the training and unlabeled data. We are also free from worrying about survey 
characteristics that noisification does not address. We perform two experiments, one using 
a simulated light curve data set and one using an OGLE light curve data set. 9 See Table 1 
for data set information. 



span selection (span= 0.0) and a high frequency penalty of a = 1.0. These choices were based on visual 
inspection of smoothing fits to light curves. 

9 Here the OGLE curves are in I-band. 



5. Experiments 



5.1. Noisification within a Survey 



Survey Source Classes" F / LC 6 



# Train # Unlabeled 



Simulated RR Lyrae, Cepheid, f3 Persei, 200-200 
(3 Lyrae, Mira 

OGLE c RR Lyrae DM, MM Cepheid, 261-474 
j3 Persei, Lyrae, WU Majoris 



500 500 



358 



165 



Table 1: Light curves used in Sections 5.1 and 5.2. 



a In the case of the simulated data, the light curves were made to resemble these classes. 



b F / LC is the first and third quartiles of flux measurements per light curve for training. 



c We use every light curves of these classes analyzed in Richards et al. (2011). 
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After splitting each data set into training and unlabeled sets, we downsample the light 
curves in the unlabeled data set to 10 through 100 flux measurements in multiples of 10. 
Now the unlabeled data sets resemble the training in every way except for the number of 
flux measurements per light curve. To each of the ten unlabeled data sets we apply four 
classifiers and compute classification accuracy on the unlabeled data sets. Figure 7 provides 
error rates for the four classifiers applied to the 10 unlabeled sets from (a) simulated and (b) 
OGLE. The four classifiers are: 



1. naive (black circles): Random Forest constructed on the unaltered training data 

2. unordered (red triangles): noisify every training light curve by matching the number 
of flux measurements in the training set and unlabeled set, but we choose a random, 
non-contiguous set of epochs (cadence information is lost) 

3. lx noisification (green plus): noisification without smoothing as described in Section 
4 

4. 5x noisification (blue x) "lx noisification" repeated five times as discussed in Section 
4 




Fig. 7. — Noisification results for (a) simulated light curves and (b) OGLE light curves. 5x 
Noisification (blue x) improves over making no adjustments for training-unlabeled data set 
differences (black circles) at all numbers of flux measurements. 
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The results in Figure 7 suggest that noisification can significantly increase classification 
performance when the unlabeled data is poorly sampled. With OGLE, "naive" misclassifies 
around 32% of light curves with 30 flux measurements while "5x noisification" misclassifies 
around 21%. Based on the difference between the "unordered" and "lx / 5x noisification" 
procedures, it appears that having a training cadence that matches the cadence of the unla- 
beled data can improve classification performance. We explore this in more detail later when 
training and unlabeled data come from surveys with different cadences. The "5x noisifica- 
tion" advantage over "lx noisification" is fairly modest. Repeatedly noisifying the training 
data and averaging the resulting classifiers reduces variance and leaves bias unchanged, so 
we see no way that using "5x noisification" instead of "lx noisification" could hurt classifier 
performance. For the remainder of the paper, noisification refers to"5x noisification." 



(a) 



(b) 
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Fig. 8. — Variable importances for the OGLE "lx noisified" classifier on (a) 10 flux mea- 
surement and (b) 100 flux measurement training sets. When the training data has few flux 
measurements non-periodic features are most important because periods cannot be estimated 
correctly. See Section 4.2 of Dubath et al. (2011) for an explanation of feature importance. 

To investigate how noisified classifiers differ, we plot feature importances for the "lx 
noisification" classifier for 10 and 100 flux measurements for the OGLE data (see Figure 8). 
Random Forest feature importance measures were introduced by Breiman (2001) and have 
been used in recent studies of periodic variables to gain an understanding of which features 
Random Forests considers most highly when assigning a class to a light curve. See Dubath 
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et al. (2011) Section 4.1 for a complete description of feature importance. Figure 8 shows 
that skew is very important for both classifiers. Notice that the 100 flux measurement classi- 
fier ranks several period based features as being important - scatter jres-raw, freqsignif, and 
freqlJiarmonics_freq_0 - while the 10 flux measurement classifier does not. The interpreta- 
tion is clear: when classifying light curves with 10 flux measurements, features that require 
a correct period will not be very useful. The process of noisifying light curves causes the 
classifier to recognize this and make use of class information present in other features. 



(b) 
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Fig. 9. — The 10-point, 50-point, and 100-point noisified classifiers applied to all of the (a) 
simulated and (b) OGLE unlabeled sets. The 50-point and 100-point noisified classifiers 
perform well on all the unlabeled data sets with more than 30 flux measurements for both 
simulated and OGLE. 

In these two examples, light curves in the unlabeled data set always had one of 10 
possible number of flux measurements (10, 20, . . . 100). The noisified light curves had exactly 
the same number of flux measurements as the unlabeled data. In practice, we will need to 
classify light curves with any number of flux measurements. It may be computationally 
challenging to construct noisified classifiers for every possible number of flux measurements. 
To test how sensitive error rates are to how light curves are noisified, we took the noisified 
classifiers for 10, 50, and 100 flux measurements and applied them across all 10 of the 
unlabeled data sets. Figure 9 shows the results for the (a) simulated and (b) OGLE data. 
We plot the error rates of these three classifiers along with the error rate of the classifier 
noisified to the number of flux measurements actually in the unlabeled data set (the "5x 
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noisified" classifiers from Figure 7). The results show that for these examples the error 
rates are fairly insensitive to exactly how many flux measurements we use in the noisified 
classifier. For the OGLE data, the classifier noisified to 10 flux measurements performs well 
until unlabeled light curves have around 70 flux measurements. Additionally the 50-flux 
and 100-flux noisified classifiers perform well for unlabeled data sets with between 30 and 
100 flux measurements. 



5.2. Noisification with Smoothing 

We now address the challenge of training a classifier on a survey with one cadence to 
classify light curves of a different cadence. In order to ensure that all differences between 
training and unlabeled data are due to issues addressed by noisification (number of flux 
measurements, cadence, photometric error) we use the simulated light curve prototypes 
from Section 5.1 for both training and unlabeled data sets. We sample the light curves at 
actual Hipparcos and OGLE light curve cadences used in previous studies (Richards et al. 
2011; Debosscher et al. 2007). 

Systematic differences exist between the OGLE and Hipparcos survey cadences. OGLE 
is a ground based survey with flux measurements taken at multiples of one day plus or minus 
a few hours. The sampling for these curves is quite regular with few large gaps. In contrast, 
Hipparcos light curves tend to be sampled in bursts, with several measurements over the 
course of less than a day followed by long gaps. 

In practice, one data set (say, Hipparcos) would be used to train a classifier in order to 
classify sources in the other (say, OGLE). However since these light curves are simulated, 
and we have labels for both sets, we create training and unlabeled data sets at Hipparcos 
and OGLE cadences so we can study the challenge of constructing a classifier on Hipparcos 
for use on OGLE sources and vice versa. We begin by generating 1000 simulated light curves 
using the class templates from Section 5.1. For 500 of these curves we randomly select an 
OGLE cadence and sample flux measurements and photometric errors from this cadence. 
We then take these 500 curves and downsample them to have 10, ... , 100 flux measurements 
in multiples of 10. The original 500 curves cadenced to OGLE is the OGLE training set, 
and the downsampled curves are the 10 OGLE unlabeled data sets. We repeat this process 
for the other 500 simulated curves at Hipparcos cadences. 

In order to test the efficacy and necessity of various aspects of the noisification process, 
we apply several classifiers to each of the unlabeled data sets. Figure 10 shows the accuracy 
of these methods treating (a) OGLE and (b) Hipparcos as the unlabeled data. For the left 
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plot with OGLE unlabeled light curves the classifiers are trained on: 

1. ogle cadence naive (black circle): unaltered OGLE light curves 

2. hipparcos cadence noisified (red triangle): Hipparcos light curves truncated to 
match length of unlabeled set, but not smoothed (cadence is different between training 
and unlabeled) 

3. hipparcos smoothed to ogle — noisified (green plus): Hipparcos light curves af- 
ter they have been smoothed, cadenced at OGLE, and truncated to match length of 
unlabeled curves 

4. ogle cadence noisified (dark blue x): noisified OGLE light curves (cadence already 
matches unlabeled set so smoothing unnecessary) 

5. hipparcos naive (light blue diamonds): unaltered Hipparcos light curves 

Not addressing cadence, flux measurement, and photometric error mismatches by train- 
ing on full length Hipparcos light curves leads to poor performance (light blue diamond). 
Noisifying these Hipparcos sources by truncation improves performance (red diamonds). 
However we gain significantly by correcting for cadence differences by smoothing (green 
plus). It is encouraging to see that by smoothing the Hipparcos training set and noisifying 
we can do as well as if we had started with OGLE cadence curves (dark blue x and green 
plus) . 

The right plot of Figure 10 displays the same information with Hipparcos as the unla- 
beled cadence. Note that the line markings have been changed to preserve relationship of 
training set to unlabeled set. The overall picture is similar to the OGLE data, except that 
convergence of error rates happens much more quickly. At 60 flux measurements there is 
little difference among any of the classifiers. 

The difference in error rates between classifiers trained on data noisified to the cadence of 
the unlabeled data and those that are not suggests that at low number of flux measurements 
feature distributions are different for the OGLE and Hipparcos cadences. To investigate 
this in Figure 11 we plot densities of amplitude for simulated light curves with 10 flux 
measurements at the OGLE and Hipparcos cadences. To keep things simple we show two 
class densities - Miras and not Miras. It is clear here that for the OGLE cadence amplitude 
is not a particularly useful feature for separating Miras from other sources whereas for the 
Hipparcos cadence it is. Due to the regular sampling at one to two day intervals, 10 flux 
measurement OGLE curves have only captured part of a Mira period. Hence the amplitude 
of the source looks much smaller than it actually is. In contrast the large gaps between 
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Fig. 10. — Simulated light curves where the unlabeled data is observed at a (a) OGLE or 
(b) Hipparcos cadence. By smoothing the training set and extracting flux measurements to 
match that of the unlabeled data (green plus), we improve performance over only matching 
number of flux measurements (red triangle). 



flux measurements in Hipparcos cadences result in us observing a much larger piece of phase 
space and thus obtaining a better estimate of amplitude. 



5.3. Using Hipparcos to Classify OGLE 

Now that we have studied noisification in some controlled settings, we test the method 
on the original problem proposed in Section 1. Recall that we are classifying Miras, RR 
Lyrae AB, and Classical Cepheids Fundamental Mode using light curves from Hipparcos 
as the training data and V-band OGLE as the unlabeled data. In Section 1 we saw that 
training a classifier on the Hipparcos curves and applying it directly to OGLE resulted in 
poor performance due, in part, to differences in number of flux measurements, cadence, and 
photometric error between the two data sets. 

Table 2 highlights some important differences between the Hipparcos and V-band OGLE 
sources. See Udalski et al. (2008); Soszynski et al. (2008, 2009a,b) for descriptions of OGLE 
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Fig. 11. — Amplitude feature distributions for Mira versus other classes for 10 flux measure- 
ments at (a) OGLE and (b) Hipparcos cadence. The feature is very useful for separating 
Miras from non-Miras at the Hipparcos cadence because of the irregular time sampling. Here 
we see how important it is for training and unlabeled data to have matching cadences, not 
just number of flux measurements. 

Ill photometry and these three source classes. 10 We use all OGLE III sources from the LMC 
belonging to the three classes of interest. 

There are systematically fewer flux measurements in OGLE than in Hipparcos. Unlike 
the previous example with I-band OGLE, the V-band OGLE curves here are fairly sparse. 
25% percent of the flux measurements are spaced 16 or more days apart. Perhaps the most 
striking difference between surveys is in the class proportions. RR Lyrae AB make up 26.6% 
of light curves in Hipparcos and 84.1% of light curves in OGLE. This is most likely due to 
Hipparcos magnitude limits which result in undersampling the intrinsically faint RR Lyrae 
AB relative to Mira and Classical Cepheids AB. 

To classify the OGLE sources, we noisify all the Hipparcos light curves to OGLE cadence 
at 10 through 100 flux measurements in multiples of 10. We then construct classifiers on 
each of these sets, resulting in 10 noisified classifiers. Each OGLE light curve is classified 



These OGLE III sources are available here: http : //ogledb . astrouw.edu.pl/~ogle/CVS/. 
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using the classifier with the closest number of flux measurements. So for an unlabeled OGhE 
light curve with 27 flux measurements, we classify it using the noisified classifier constructed 
on the 30- flux measurement training set. 

Table 3 displays a confusion matrix for the classifier constructed on the unmodified 
Hipparcos light curves when it is applied to the OGLE light curves. Table 4 shows the error 
rate using the noisification procedure. The overall error rate drops from 27% to 7% as a 
result of using noisification. This is driven by the drop in error rate for RR Lyrae AB (31% 
error using unmodified classifier, 7% after noisification) and the prevalence of RR Lyrae AB 
in OGLE. The error rate for Classical Cepheids F actually increases from 2% to 10% while 
for Miras it is roughly the same. 

Part of the reason why noisification increases the error rate for Classical Cepheids ap- 
pears due to differences in distribution of frequency caused by Hipparcos magnitude limits. 
Figure 12 displays frequency density in Hipparcos, 35-45 flux length OGLE, and Hipparcos 
noisified to 40 flux for Cepheids (12a), RR Lyrae (12b), and Miras (12c). Noisification has 
not changed the density at all for the Cepheid sources (the blue and orange density almost 
exactly overlap) for the Cepheids. Visual inspection of OGLE periods revealed that they 
were correct. This suggests that the frequency distribution for Cepheids is fundamentally 
different in Hipparcos and OGLE. This is likely due to magnitude limits in Hipparcos and 
OGLE. 

Lower frequency Cepheids are intrinsically brighter, so we can see them from further 
away. These low frequency Cepheids are over-represented in Hipparcos. In contrast OGLE is 
closer to a random sample of Cepheids in the Large Magellanic Cloud (LMC). If it is there, we 
see it. Since this survey difference is not caused by number of flux measurements, cadence, 
or photometric error, the current implementation of noisification does not correct for it. 



Survey # Sources Class Probs. a F / LC b Time DifF Error d 

Hipparcos e (training) 357 (0.45,0.27,0.28) 81-119 0.01-0.25 0.015-0.034 

OGLE (unlabeled) 20605 (0.09,0.84,0.07) 36-74 5.1-16.0 0.022-0.050 

Table 2: Training and unlabeled set characteristics for example in Section 1 and Subsection 
5.3. 

a Class probs. is the class proportion of (Classical Cepheids F, RR Lyrae AB, Mira). 

b F / LC is the first and third quartiles of flux measurements per light curve for training. 

c Time Diff is the first and third quartilcs of time differences in days between successive flux measurements. 
d Error is the first and third quartiles of estimated photometric error in magnitude for all flux measurements. 
e Light curves and classifications from Richards et al. (2011). 
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0.31 


Err. Rate 


0.75 


0.05 





0.27 



Table 3: Confusion matrix for classifier constructed on the unmodified Hipparcos light curves 
and applied to OGLE. Rows are true class and columns are predictions. The overall error 
rate is driven by the performance on the most abundant class, RR Lyrae AB. 

Predicted 
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0.07 


Err. Rate 


0.42 


0.05 


0.01 


0.07 



Table 4: Confusion matrix for classifier constructed on noisified Hipparcos light curves. Rows 
are true class and columns are predictions. The overall error rate has dropped to .07 from 
.26. This is due to better predicting RR Lyrae AB sources. The error rate on Classical 
Cepheids has actually increased. 

Notice that in Figure 12 right plot, the noisification procedure has shifted the distribution 
of RR Lyrae frequencies in Hipparcos to more closely match that in OGLE. Here much of 
the density mismatch was due to error in estimation of frequency due to having few flux 
measurements. Noisification helps us overcome this survey difference. 

Noisification is successful at matching other feature distributions. Figure 13 displays 
the densities of P2PS for each sources class in 13a Hipparcos, 13b OGLE, and 13c Hipparcos 
noisified. There is a great deal of difference between Hipparcos and OGLE densities. However 
the noisified Hipparcos source densities appear to closely match the densities of OGLE. 

6. Conclusions 

We have highlighted how differences between training and unlabeled light curves induce 
different feature distributions. We then showed how these shifts in distribution can cause 
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Fig. 12. — Density of frequency in OGLE light curves with 35-45 flux measurements (black 
solid), Hipparcos before noisification (blue dots) and after noisification to 40 flux measure- 
ments (orange dashed) for (a) Classical Cepheids F, (b) RR Lyrae AB, and (c) Miras. 
Noisification of Cepheid periods does not match training and unlabeled densities because 
of survey differences not related to number of flux measurements, cadence or photometric 
error. 

high error rates, even on problems where the unlabeled data is well separated in feature 
space. Common methods to evaluate classifier performance, such as cross-validation, do not 
detect these shifts in distribution and may give a false impression of classifier quality as they 
only reveal how well a classifier performs on data that is similar to the training set. 

We developed a methodology, noisification, for overcoming differences between training 
and unlabeled data sets. As implemented in this study, noisification addresses differences 
due to the number of flux measurements, cadence, and photometric error. On several sim- 
ulated and real-world examples, noisification greatly improved classifier performance. In 
the Hipparcos training-OGLE unlabeled example, noisification reduced the misclassification 
rate by 20%. 

We hope these findings motivate practitioners to carefully consider differences between 
training and unlabeled data sets. In general, we recommend using training sets that match 
as closely as possible the unlabeled set of interest rather than training sets that are high 
signal-to-noise. As demonstrated in many examples, high signal-to-noise light curves often 
work poorly as training sets when the unlabeled light curves are of low quality. This is due 
to the classifier discovering class boundaries in feature space as they exist in the training 
set, not as they exist in the unlabeled set. 

This study has made us skeptical of attempts to identify a single set of features that 
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Fig. 13. — (a) P2PS in Hipparcos un-noisified data. The feature appears useful for separating 
RR Lyrae from Miras and Classical Cepheids. (b) P2PS in OGLE for light curves with 
35-45 flux measurements. Now Classical Cepheids have nearly the same density as RR 
Lyrae. A classifier trained on the un-noisified Hipparcos light curves would not capture this 
relationship, (c) P2PS for Hipparcos light curves noisified to OGLE cadence with 40 flux 
measurements. The densities now closely resemble the OGLE light curves. 

is generically sufficient for separating a set of classes of periodic variables. Useful features 
change depending on how sources are observed. The Random Forest importance plots (Figure 
8) and the P2PS simulation (Subsection 3.2) illustrate this. When implementing noisifica- 
tion, we recommend starting with large feature sets, even including features that are not 
useful for separating classes in the training data. These features may become useful for 
separating classes once the light curves have been noisified. 

While we have studied noisification in the context of classification, it could also be ap- 
plied to other problems. For example, novelty detection and unsupervised learning (cluster- 
ing) methods are likely to work poorly when training and unlabeled data sets have systematic 
differences. Noisifying light curves offers a way to overcome these differences. 

Noisification may also be extended from what is implemented here to account for differ- 
ences not related to number of flux measurements, cadence, and level of photometric error. 
For example, known censoring thresholds in the unlabeled data could be incorporated into 
the training data by removing, or marking as censored, flux measurements which would not 
have been observed in the unlabeled data set due to magnitude limits. 

In the future, we will apply noisification to light curves from more surveys using larger, 
highly multi-class training sets. In parallel, we are developing a theoretical understanding of 
how noisification works and the problems for which it is most suitable. Of particular interest 
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is how noisification performs when there are survey differences not addressed by noisification. 
This was the case with the Cepheid frequencies in the three-class Hipparcos-OGLE problem. 

Upcoming surveys pose a challenge based in their size and their novelty. Not only will 
Gaia and LSST detect orders of magnitude more periodic variables than previous surveys, 
the sources they collect will have different properties than any training data we currently 
have. Noisification offers the potential to bridge some of these differences, enabling us to 
optimize scientific discovery. 

The authors would like to thank Laurent Eyer and Dan Starr for helpful comments and 
criticisms. The authors would like to acknowledge the generous support of a Cyber-Enabled 
Discovery and Innovation (CDI) grant (No. 0941742) from the National Science Founda- 
tion. This work was performed in the CDI-sponsored Center for Time Domain Informatics 
(http://cftd.info). 

A. Description of Features 

We used 62 features in this work. Fifty of these features came from Tables 4 and 5 
in Richards et al. (2011). We did not use the features pair_slope_trend, max_slope, or 
linear _trend from these tables. We used 12 additional features. Five are from Dubath et al. 
(20 11). 11 The remaining seven are: 

1. fold2P_slope_10percentile 10th percentile of slopes between adjacent flux measure- 
ments after the light curve has been folded on twice the estimated period 

2. fold2P_slope_90percentile 90th percentile of slopes between adjacent flux measure- 
ments after the light curve has been folded on twice the estimated period 

3. freq_frequency_ratio_21 ratio of the second to first frequency determined by lomb- 
scargle from Table 4 in Richards et al. (2011)) 

4. freq_frequency_ratio_31 ratio of the third to first frequency determined by lomb- 
scargle from Table 4 in Richards et al. (2011)) 

5. freq_amplitude_ratio_21 ratio of amplitude for frequency 2 to amplitude for fre- 
quency 1 (^j- from Table 4 in Richards et al. (2011)) 



11 scatter_res_raw, medperc90_2p_p. p2p_scatter_2praw. P2PS (named P2p_scatter in Dubath ct al. 
(2011)), and p2p_scatter_pfold_over_mad 
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6. freq_amplitude_ratio_31 ratio of amplitude for frequency 3 to amplitude for fre- 
quency 1 (jj^ from Table 4 in Richards et al. (2011)) 

7. p2p_ssqr_diff_over_var 12 the sum of squared differences in successive flux measure- 
ments divided by the variance of the flux measurements 
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